Compounding and Derivational Morphology in a Finite-State Setting
نویسنده
چکیده
This paper proposes the application of finite-state approximation techniques on a unification-based grammar of word formation for a language like German. A refinement of an RTN-based approximation algorithm is proposed, which extends the state space of the automaton by selectively adding distinctions based on the parsing history at the point of entering a context-free rule. The selection of history items exploits the specific linguistic nature of word formation. As experiments show, this algorithm avoids an explosion of the size of the automaton in the approximation construction. 1 The locus of word formation rules in grammars for NLP In English orthography, compounds following productive word formation patterns are spelled with spaces or hyphens separating the components (e.g., classic car repair workshop). This is convenient from an NLP perspective, since most aspects of word formation can be ignored from the point of view of the conceptually simpler token-internal processes of inflectional morphology, for which standard finite-state techniques can be applied. (Let us assume that to a first approximation, spaces and punctuation are used to identify token boundaries.) It makes it also very easy to access one or more of the components of a compound (like classic car in the example), which is required in many NLP techniques (e.g., in a vector space model). If an NLP task for English requires detailed information about the structure of compounds (as complex multi-token units), it is natural to use the formalisms of computational syntax for English, i.e., context-free grammars, or possibly unificationbased grammars. This makes it possible to deal with the bracketing structure of compounding, which would be impossible to cover in full generality in the finite-state setting. In languages like German, spelling conventions for compounds do not support such a convenient split between sub-token processing based on finitestate technology and multi-token processing based on context-free grammars or beyond—in German, even very complex compounds are written without spaces or hyphens: words like Verkehrswegeplanungsbeschleunigungsgesetz (‘law for speeding up the planning of traffic routes’) appear in corpora. So, for a fully adequate and general account, the tokenlevel analysis in German has to be done at least with a context-free grammar:1 For checking the selection features of derivational affixes, in the general case a tree or bracketing structure is required. For instance, the prefix Fehlcombines with nouns (compare (1)); however, it can appear linearly adjacent with a verb, including its own prefix, and only then do we get the suffix -ung, which turns the verb into a noun.
منابع مشابه
Tokenization and Morphological Analysis for Malagasy
The authors present a tokenizer and nite-state morphological analyzer [Beesley and Karttunen 2003] for Malagasy, based primarily on the discussion of Malagasy morphology in Keenan and Polinsky [1998] and Randriamasimanana [1986]. Words in Malagasy are built from roots by means of a variety of morphological operations such as compounding, afxation and reduplication. The authors analyze product...
متن کاملMorphological awareness and early and advanced word recognition and spelling in Dutch
This study investigated the relations of three aspects of morphological awareness to word recognition and spelling skills of Dutch speaking children. Tasks of inflectional and derivational morphology and lexical compounding, as well as measures of phonological awareness, vocabulary and mathematics were administered to 104 first graders (mean age 6 years, 11 months) and 112 sixth graders (mean a...
متن کاملA Two-level Morphology of Malagasy
We present a two-level model of Malagasy nominal and verbal morphology (Beesley and Karttunen, 2003), based primarily on the discussion of Malagasy morphology in Keenan and Polinsky (1998) and Randriamasimanana (1986). Words in Malagasy are built from roots by means of a variety of morphological operations such as affixation and reduplication. The present paper analyzes productive patterns of n...
متن کاملComputer Analysis of the Turkmen Language Morphology
This paper describes the implementation of a two-level morphological analyzer for the Turkmen Language. Like all Turkic languages, the Turkmen Language is an agglutinative language that has productive inflectional and derivational suffixes. In this work, we implemented a finite-state two-level morphological analyzer for Turkmen Language by using Xerox Finite State Tools.
متن کاملTwo-level morphology as phonology: Parallel automata, simultaneous rule application, and the elsewhere condition.*
Two-level morphology is a system which relates lexical representations of morpheme sequences directly to their surface/phonetic forms, without intermediate derivational stages as in most other generative approaches. It thus draws its name from the restriction of phonology to two levels of representation, and perhaps should more properly be known as two-level phonology. The morphological compone...
متن کامل